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Abstract 

We propose the use of statistical emulators for the purpose of valuing mortality-linked contracts 
in stochastic mortality models. Such models typically require (nested) evaluation of expected 
values of nonlinear functionals of multi-dimensional stochastic processes. Except in the simplest 
cases, no closed-form expressions are available, necessitating numerical approximation. Rather 
than building ad hoc analytic approximations, we advocate the use of modern statistical tools 
from machine learning to generate a flexible, non-parametric surrogate for the true mappings. 
This method allows performance guarantees regarding approximation accuracy and removes the 
need for nested simulation. We illustrate our approach with case studies involving (i) a Lee- 
Carter model with mortality shocks, (ii) index-based static hedging with longevity basis risk; (iii) 
a Cairns-Blake-Dowd stochastic survival probability model. 

Keywords: Statistical emulation, longevity risk, life annuities, valuation of mortality-contingent 
claims 


1. Introduction 


Longevity risk has emerged as a key research topic in the past two decades. Since the seminal 
work of Lee and Carter [28] there has been a particular interest in building stochastic models of 
mortality. Stochastic mortality allows for generation of a range of future longevity forecasts, and 
permits the modeler to pinpoint sources of randomness, so as to better quantify respective risk. 
Longevity modeling calls for a marriage between the statistical problem of calibration, i.e. fitting 
to past mortality data, and the financial problem of pricing and hedging future longevity risk. 
At its core, the latter problem reduce to computing expected values of certain functionals of the 
underlying stochastic processes. For example, the survival probability for t years for an individual 
currently aged x can be expressed as a functional 


P(t, x) = E 


exp 


— / x + s) ds 

Jo 


(1) 


where p(s, x + s) is the force of mortality at date s for an individual aged x + s. In the stochastic 
mortality paradigm fi(s, x + s) is random for s > 0, and so one is necessarily confronted with the 
need to evaluate the corresponding expectations on the right-hand-side of (1). 

The past decade has witnessed a strong trend towards complexity in both components of (1). 
On the one hand, driven by the desire to provide faithful fits (and forecasts) to existing mortality 
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data, increasingly complex mortality models for n(t, x) have been proposed. The latest genera¬ 
tion of models feature multi-dimensional, nonlinear stochastic state processes driving see 

e.g. Cairns et al. [11], Li et al. [29], Lin et al. [30], Barrieu et al. [2], Fushimi and Kogure [19]. These 
models are effective at calibration and emitting desirable forecasts, but lack tractability in terms 
of closed-form formulas. On the other hand, sophisticated insurance products, such as variable an¬ 
nuities or longevity swap derivatives, make valuation and hedging highly nontrivial, and typically 
call for numerical approaches, as closed-form formulas are not available. Taken together, pricing of 
mortality-linked contracts becomes a complex system, feeding multi-dimensional stochastic inputs 
through a “black box” that eventually outputs net present value of the claim. 

These developments have created a tension between the complexity of mortality models that 
do not admit explicit computations and the need to price, hedge and risk manage complicated 
contracts based on such models. Due to this challenge, there remains a gap between the academic 
mortality modeling and the implemented models by the longevity risk practitioners. Because 
the aforementioned valuation black box is analytically intractable, there is a growing reliance on 
Monte Carlo simulation tools, which in turn is accompanied by exploding computational needs. 
For example, many emerging problems require nested simulations which can easily take days to 
complete. Similarly, many portfolios contain millions of heterogeneous products (see, e.g. Gan 
and Lin [20]) that must be accurately priced and managed. In this article we propose to apply 
modern statistical methods to address this issue. Our approach is to bridge between the mortality 
modeling and the desired pricing/hedging needs through an intermediate statistical emulator. The 
emulator provides a computationally efficient, high-fidelity surrogate to the actual mortality model. 
Moreover, the emulator converts a calibrated opaque mortality model into a user-friendly valuation 
“app”. The resulting toolbox allows a plug-and-play strategy, so that the end user who is in charge 
of pricing/risk-management can straightforwardly swap one mortality model for another, or one 
set of mortality parameters for an alternative. This modular approach allows a flexible solution to 
robustify the model-based longevity risk by facilitating comparisons of different longevity dynamics 
and different assumptions. 

Use of emulators is a natural solution to handle complex underlying stochastic simulators and 
has become commonplace in the simulation and machine learning communities [36, 26]. Below we 
propose to apply such statistical learning within the novel context of insurance applications. In 
contrast to traditional (generalized) linear models, emulation calls for fully nonparametric mod¬ 
els, which are less familiar to actuaries. To fix ideas, in this article we pursue the problem of 
pricing/hedging vanilla life annuities, a foundational task in life insurance and pension plan man¬ 
agement. Except in the simplest settings, there are no explicit formulas for annuity values and con¬ 
sequently approximation techniques are already commonplace. Looking more broadly, our method 
would also be applicable in many other actuarial and risk management contexts, see Section 7. 

The paper is organized as follows: In Section 2 we introduce the emulation problem and re¬ 
view the mathematical framework of stochastic mortality. Section 3 discusses the construction of 
emulators, including spline and kriging surrogates, as well as generation of training designs and 
simulation budgeting. The second half of the paper then presents three extended case studies 
on several stochastic mortality models that have been put forth in the literature. In Section 4 
we examine a Lee-Carter model with mortality shocks that was proposed by Chen and Cox [14]; 
Section 5 studies approximation of hedge portfolio values in a two-population model based on the 
recent work by Cairns et al [13]. Lastly, Section 6 considers valuation of deferred annuities under 
a Cairns-Blake-Dowd (CBD) [9] mortality framework. 
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2. Emulation Objective 

We consider a stochastic system with Markov state process Z = (Z(t)). Throughout the paper 
we will identify Z with the underlying stochastic mortality factors. In Section 2.2 we review 
some of the existing such models and explicit the respective structure of Z. Typically, Z is a 
multivariate stochastic process based on either a stochastic differential equation or time-series 
ARIMA frameworks. For example, Z may be of diffusion-type or an auto-regressive process. 

In the inference step, the dynamics of Z are calibrated to past mortality data that reflect as 
closely as possible the population of interest. In the ensuing valuation step, the modeler seeks 
to evaluate certain quantities related to a functional F(T,Z(-)) looking into the future. The 
time horizon T > 0 allows consideration of deferred contracts that are common in longevity risk, 
see below. Our notation furthermore indicates that F potentially depends on the whole path 
{Z(t),t > T}, such as 

OO 

E(T,Z(0)=exp(-J>(Z(i))), (2) 

t=T 

for some h(z). Given F. the most common aim is to compute its expected value, based on the 
initial data at t = 0, 


E[F(T,Z(-))|Z(0)]. (3) 

Other summary statistics of interest in actuarial applications include the 

• Quantile g(a; F(T, Z(-))) (eg. the Value-at-Risk at level a of F); 

• Expected Shortfall of F, E [F(T, Z(-)) | F(T, Z(-)) < q(a; F(T, Z(■))), Z(0)]; 

• Correlation between two functionals, Corr(Fi(T, Z(-)), ^(T, Z(-))|Z(0)). 

To fix ideas we henceforth focus on (3) which is a fundamental quantity in pricing/hedging 
problems. When T > 0, the evaluation of (3) can be broken into two steps, namely first we 
evaluate 


f(z)=E[F(T,Z(.))\Z(T) = z], (4) 

and then use the Markov property of Z to carry out an outer average, 

E[F(T,Z(-))|Z(0)]= [ f(z)p T (z\Z(0))dz, 

JR d 

where pt{z'\z) = P(Z(T) = z'\Z(0) = z) is the transition density of Z over [0, T]. 

Crucially, because the form of F(T, Z(-)) is nontrivial, we shall assume that f(z ) is not available 
explicitly, and there is no simple way to describe its functional form. However, since f(z ) is a 
conditional expectation, it can be sampled using a simulator, i.e. the modeler has access to an engine 
that can generate independent, identically distributed samples F(T, Z^ n \-)), n = 1,..., given Z(0). 
However this simulator is assumed to be expensive, implying that computational efficiency is desired 
in using it. 
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Given an initial state Z( 0), a naive Monte Carlo approach to evaluate (3) is based on nested 
simulation. First, the outer integral over pt(z\Z(0)) is replaced by an empirical average of (4) 
across m = 1,..., N out draws ~ Z(T)\Z(0), 


E[F(T,Z(-))|Z(0)]~ 


1 

N ou t 


N 0 ut 

E 

m =1 


( 5 ) 


Second, for each z^N the corresponding inner expected value /( z ( m )) is further approximated via 





Min 

£>(T, *("*)-"(■)), 

n= 1 


?7l — 1, . . . , N ou t , 


( 6 ) 


where z^ rn ' ),n (t),t > T are 7V m independent trajectories of Z with a fixed starting point z^ m ^ ,n (T) = 
2 ;( m ). This nested approach offers an unbiased but expensive estimate. Indeed, the total simu¬ 
lation budget is 0(N out ■ Ni n ) (where the usual big-Oh notation h(x) = O(x) means that h(-) is 
asymptotically linear in x as x — > oo) which can be computationally intensive - for example 1,000 
simulations at both steps requires 10 6 total simulations. 

For this reason, it is desirable to construct cheaper versions of approximating (3). The main 
idea is to replace the inner step of repeatedly evaluating f(z) (possibly for some very similar values 
of z) with a simpler alternative. One strategy is to construct deterministic approximations to (4) 
by replacing the random variable Z(s)\Z(T), s > T with a fixed constant, e.g. its mean, which 
can then be plugged into F to estimate the latter’s expected value. This effectively removes the 
stochastic aspect and allows to obtain explicit approximations to /(•). (The simplest approximation 
is to simply freeze Z(s) = Z(T)Vs > T .) However, the resulting error is hard to judge, and 
moreover, analytic, off-line derivations are needed to obtain a good approximation. Consequently, 
we advocate the more statistical method of utilizing a surrogate model for /(•). This approach can 
be generically used in any Markovian setting, requires no analytic derivations, and makes minimal 
a priori assumptions about the structure of /(•). 

The main idea of emulation is a regression framework that generates a fitted /(•) by solving 
regression equations over a training dataset {z^ n \ F(T, 2^(‘))}^=l °f size Nt. r - Emulation reduces 
approximating /(•) to the twin statistical problems of (i) experimental design (generating the 
training dataset) and (ii) regression (specifying the optimization problem that the approximation 
/ solves). Details of these steps are presented in Section 3 below. 

Because we are fitting a full response model, rather than a pointwise estimate, the emulator 
budget Nf r 3> Ni n will be an order of magnitude bigger than in (6). It will also require regression 
overhead. However, once / is fitted, prediction of f(z) for a particular value z takes 0(1) effort, 
so that we can use (5) to estimate the original problem in (3) at a cost linear in N ou i. To sum up, 
the total budget of the emulator is just 0(N tr + N out ), much smaller than 0(N out x !Vj n ) of nested 
Monte Carlo. These savings become even more significant as the dimension of state Z grows. 
Indeed, with multi-dimensional models, both N ou t and N in need to be larger to better cover the 
respective integrals over M d , and hence the efficiency of nested simulations will deteriorate quickly. 
Intuitively, the latter computational budget is at least quadratic in d. In contrast, the intuitive 
complexity of an emulator is linear in d. As stochastic mortality models become more complex, 
models with d = 3,4, 5+ factors are frequently proposed, and efficiency issues become central to 
the ability of evaluating (3) tractably. 
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2.1. Valuation of Life Annuities 

In longevity modeling, Z represents the stochastic factors driving the central force of mortality 
m(t,x). Formally, Z = ( Z(t )) = ... , Zd(t)) is a d— dimensional (P(t)) measurable Markov 

process on a complete filtered probability space (11, P, P, (P(t))). The filtration (P(t)) is the 
information up to time t of the evolution of the mortality processes. 

A typical state-of-the-art model decomposes m(t, x) into a longevity trend, an Age effect, and 
a Cohort effect (known collectively as APC models). Each of the above may be modeled in turn 
by one or more stochastic factors. The most common models are the Lee-Carter [28] and CBD [9] 
models and their generalizations. Generally their individual components follow an ARIMA model; 
details can be found in the survey Cairns et al. [11]. 

To deal with cashflows at different dates, we assume the existence of a risk-free asset and 
denote by B(T,T + s ) the price of an s-bond at date T with maturity at T + s. For the rest 
of the article we will assume constant force of interest r, leading to B(T,T + s) = e~ rs . One 
can straightforwardly handle stochastic interest rates (which then form part of Z(-)); see Jalen 
and Mamon [25] for a discussion of correlation structure between mortality and interest rates and 
Fushimi and Kogure [19] for an example that applies Bayesian methods to longevity derivative 
pricing under a Cox-Ingersoll-Ross interest rate model. 

Consider an individual aged x at time 0 whose remaining lifetime random variable is denoted 
as t x . The state process Z captures m(t,x + 1 ), the mortality rate process for t x at time t, when 
the individual would have aged to x + t. For small dt, the instantaneous probability of death is 
approximately m(t,x + t)dt, so that the random survival function of t x is 


S(t, x) = exp 


/ m(s,x + 

J o 



( 7 ) 


More generally for u < t < T, the probability of an individual aged x to survive between dates t 
and T, given the information at time u is given by 


P(t x > T I T x > t, P u ) = E 

r s ( t , X ) 

Pu 



S(t , x) 




/ 

rT 

\ 

- 

= E 

exp ( - 

j m(s,x + s)J 

Z(u) 


P(Z(u);t,T,x ), 


( 8 ) 


where the last equality follows from the Markov property. The deterministic analogue of P(Z(0);t, T, x) 
in actuarial literature is r-tPx+t- 

As a canonical actuarial contract, we henceforth focus on deferred life annuities. These con¬ 
tracts are fundamental to valuation of defined benefit pension plans, which normally begin paying 
annuitants at retirement age (typically age 65) and continue until their death, possibly with sur¬ 
vivor benefits. (For valuation purposes the payment is assumed to end at some pre-specified upper 
age x, e.g. 100 or 110). A major problem of interest is valuing such life annuities for current plan 
participants who are still working, i.e. under age 65. Because this requires making longevity pro¬ 
jections many decades into the future, longevity risk becomes a crucial part of risk management. 
The net present value of a life annuity at date T is 


i{Z(T)- T, x) = J2 B(T, T + s)P(r a; > T + s \ P T ) = ^ e~ rs P(Z(T)-T,T + s, x), (9) 


S=1 


S=1 
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where we emphasize that the random mortality shocks come from Z. Finally, the net present 
value at t = 0 is NPV = K[e~ rT ■ a(Z(T),T,x)\, which can be seen as an instance of (3) that 
includes discounting and integrating over the density of Z(T). Except for the simplest models, the 
survival probability P(z ; •) is not analytically known and hence neither is (9) or NPV. Without a 
representation for z H > a(z, T, x ) one is then forced to resort to approximations for all the basic 
tasks of pricing, hedging, asset liability management, or solvency capital computation. The dis¬ 
cussed nested simulation takes the form of first approximating a(z^\ T , x) for some representative 
scenarios (z^\ ..., z^), and then further manipulating the resulting “empirical” distribution of 
(a(z^\ T,x),. .., a(z( n \T, x)). Emulation provides a principled statistical framework for optimiz¬ 
ing, assessing and improving such two-level simulations. 

Remark. As mentioned, estimation of a(-,T,x) is usually a building block embedded in a larger 
setting which requires repeated evaluation of the former quantity. For instance, Bauer et al. 
[4] addresses nested Monte Carlo simulations in calculating the present value of life-annuity-like 
instruments in the calculation of solvency capital requirements. Let us also mention the works 
Bacinello et al. [1], Boyer and Stentoft [6] who considered valuation of mortality contracts with 
early exercise features, such as surrender guarantees. The respective least squares Monte Carlo 
algorithms can be seen as classical parametric linear-model emulators in our terminology, eschewing 
the need for nested Monte Carlo forests. 


2.2. Stochastic Mortality 

We concentrate on discrete-time mortality models which are easier to calibrate to the discrete 
mortality data, typically aggregated into annual intervals. The common assumption is that the 
central force of mortality remains constant through a given calendar year, so that for all 0 < s, u < 
1, we have m(t + s, x + u) = m(t , x). Therefore 


P{Z(u ); t, T,x) = E 

/ T \ 

exp — ^2 m(s, x + s) 

Z{u) 


\ s=t+l / 



( 10 ) 


Thus, P{Z(T ); T,T + s, x + T) becomes a functional of the trajectory of Z. 

Three major approaches to stochastic mortality have been put forward in the literature. The 
first approach, pioneered by Lee and Carter [28], directly treats m(t,x ) as a product of individual 
stochastic processes, e.g. ARIMA time-series. This setup allows incorporating demographic in¬ 
sights, as well as disentangling age, period and cohort effects in future forecasts. To wit, the popular 
age-period-cohort (APC) mortality models assume that (see Appendix A for more details) 

log m(t,x) = /3^ + — ^ 2 \t) + — 7 ® (t — x), ( 11 ) 

n a n a 

where and 7 ^ are stochastic processes and n a is the number of ages that x can take in fitting. 
In this case, the state process Z(t) depends on current and potentially past values of k ^ and 
Attempts to understand the statistical validity of such models have been done by, for example, Lee 
and Miller [27], [ 8 ], Booth et al. [5], Czado et al. [16], Delwarde et al. [18], and Li et al. [29]. In 
addition, there have been several extensions of the Lee Carter model by Renshaw and Haberman 
[34], Hyndman and Ullah [24], Plat [32], Debonneuil [17], and Cairns et al. [12]. 

None of these models admit closed form expressions for survival probabilities P(z;-). Con¬ 
sequently, several authors have proposed approximation methods. Coughlan et al. [15] used a 
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bootstrapping approach, while Cairns et al. [13] derived an analytic approximation, commenting 
that industry practice is to utilize deterministic projections. The more flexible tool of Monte Carlo 
simulation has been applied in Bauer et al. [4] among others. 

The second approach, due to Cairns et al. [9] (CBD), generates a stochastic model for the 
survival probability (10), allowing for straightforward pricing of longevity-linked products; however, 
it is more difficult to calibrate and to obtain reasonable forecasts for future mortality experience 
in a population as a whole. The third approach works with forward mortality rates [3], borrowing 
ideas from fixed income markets. Forward models give a holistic view of how the mortality curves 
can evolve over time, and presents a dynamically consistent structure for mortality forecasting. 
Once again however, they do not provide easy expressions for (10) and hence require further 
manipulation for pricing purposes. 


2.3. Bias/Variance Trade-Off 

With a view towards approximating (9), it is imperative to first quantify the resulting quality 
of an approximation. The standard statistical approach is to use the framework of mean squared 
error. Fix z and let a{z) = a(z, T, x ) be the true value of a life annuity conditional on state 
Z(T) = z. If a(z) is being estimated by a(z), then 

IMSE(d) = E [(d — a) 2 ] , Bias(a) = E [a — a], (12) 


where the averaging is over the sampling distribution (i.e. different realizations of data used in 
constructing it) of a(z). 

Starting with (12) leads to the fundamental bias/variance trade-off. At one end of the spectrum, 
a Monte Carlo estimate as in (6) has zero bias but carries a high variance. At the opposite end, an 
analytic approximation has zero variance, but will have a non-zero bias that cannot be alleviated 
(whereas the Monte Carlo IMSE will go to zero as the size of the dataset grows Af r — > oo) even 
asymptotically. Because low variance is often preferred practically, analytic methods have remained 
popular. Cairns et al. [13] echoes that it is usual practice in industry to use a deterministic 
projection of mortality rates rather than use a simulation approach. 

The basic idea for the deterministic approximations is that if rh{t , x) is an unbiased estimate 
for m(t,x), then 


P(Z(u);t, T,x) = E 


exp — 22 m(s, x + , 


s=t+l 


Z(u) 


exp 


'22 E [m(s, x + s) | Z(u)]^ = exp (— 22 x + s \ -Z’('u))'] • (13) 

S=t-\~1 / \ S=t+1 / 


Using the estimate for P(Z(u);-) in (13) one can then approximate a(Z(T),T,x ) term-wise. 
Jensen’s inequality implies that exp ff2=t+i x + s )) > P(Z(u);t , T, x). Consequently, any 
such approximation is guaranteed to be biased high for the survival probabilities (and subsequently 
the annuity values). 

Analytic approximations can be very powerful and of course very fast but they carry two major 
disadvantages. One is the need to derive a suitable estimator rh. This may be possible in a simple 
model (e.g. low-dimensional Z with linear dynamics, like in the original Lee-Carter model), but 
otherwise may require a lot of off-line labor, leading to unnecessary focus on simplifications at the 
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expense of calibration and risk management consistency. Second, the degree of accuracy of the 
approximation is unknown. Indeed, there is generally not much that is available about empirical 
accuracy of the right-hand-side in (13) for a given model, leaving the user in the dark about how 
much error is being made. This issue is very dangerous, since potentially major mis-valuations 
may creep up unbeknownst to the risk manager. 

To remedy the above shortcomings, while still maintaining significant variance reduction com¬ 
pared to plain MC, we advocate the use of statistical emulators. The latter offer posterior quan¬ 
tification of accuracy (via standard error or Bayesian posterior variance), and do not require any 
simplifications of the mortality model. An additional advantage is that one can directly approxi¬ 
mate z H > a(z, T, x) without having to do intermediate approximations of the survival probabilities 
(which inevitably lead to further error compounding). As we demonstrate in the case studies, sta¬ 
tistical models for a(z) can indeed efficiently address the bias/variance trade-off, by maintaining 
negligible bias and small variance, leading to improved IMSE metrics compared to other approaches. 

3. Statistical Emulation 

The idea of emulation is to replace the computationally expensive process of running a Monte 
Carlo sub-routine to evaluate f(z ) for each new site z with a cheap-to-evaluate surrogate model 
that statistically predicts f(z) for any z £ based on results from a training dataset. At 
the heart of emulation is statistical learning. Namely, the above predictions are based on first 
obtaining pathwise estimates y^ = F(T , z ^), n = 1,..., N tr for a set of training locations, called 
a design V = (z^\ ..., z^ Ntr ^). Next, one regresses {y( n )} against { z to “learn” the response 
surface /(•). The regression aspect allows to borrow information across different scenarios starting 
at various sites. This reduces computational budget compared to the nested simulation step of 
independently making Nf r pointwise estimates f(z^) by running N vn scenarios from each site 
z^ n \ The conceptual need for regression is two-fold. First, the emulator is used for interpolation, 
i.e. using existing design to make predictions at new sites z. In contrast, plain Monte Carlo only 
predicts at z^’s. Second, like in the classical approach, the emulator smoothes the Monte Carlo 
noise from sampling trajectories of {Z(s), s > T}. 

Formally, the statistical problem of emulation deals with a sampler (or oracle) 

Y(z) = f(z) + e(z), (14) 

where we identify f(z) = a(z , T, x) with the unknown response surface and e is the sampling noise, 
assumed to be independent and identically distributed across different calls to the oracle. We 
make the assumption e(z) ~ IV(0, t 2 (z)), where t 2 (z) is the sampling variance that depends on 
the location z. Emulation now involves the (i) experimental design step of proposing a design 
V that forms the training dataset, and (ii) a learning procedure that uses the queried results 
(z( n ), y( n ))^Ti, with the y being realizations of (14) given z^ n \ to construct a fitted response 
surface /(•). The fitting is done by specifying the approximation function class f &H, and a loss 
function L(/, f ) which is to be minimized. The loss function measures the relative accuracy of / 
vis-a-vis the ground truth; in this paper we focus on the mean-squared approximation error 

L{f,f)=[ \f(z)~ f{z)\ 2 dz. (15) 

J R d 

Because the true / is unknown, the definition of L(f, f ) cannot be operationalized and instead 
a proxy based on the uncertainty (such as Bayesian posterior uncertainty or standard errors) 



surrounding / is applied. Also, since the structure of / is unknown, it is desirable that the 
approximation class R is dense, i.e. has a sufficiently rich architecture to approximate any / to an 
arbitrary degree of accuracy. To this end, we concentrate on kernel regression methods, namely 
linear smoothers. In the next subsections we introduce two such regression families, smoothing 
splines and kriging (Gaussian process) models. 

Remark. In this paper we focus on the original task of producing an accurate approximation to / 
everywhere. In some contexts, accuracy is judged not globally, but locally, so that a differentiated 
accuracy measure is used. For example, in VaR applications, the model for / must be accurate 
in the left-tail, but can be rather rough in the right-tail. In this case, (15) can be replaced by a 
weighted loss metric. 

3.1. Emulators based on Spline Models 

We generate emulators /(•) using a regularized regression criterion. To wit, given a smoothing 
parameter A > 0 we look for the minimizer / G R of the following penalized residual sum of squares 
problem 


N tr 

RSS(f, A) = £> (n) - /C* (n) )} 2 + AJ(/), (16) 

n=l 

where </(/) is a penalty or regularization function. We concentrate on the case where the approxi¬ 
mation class has a reproducing kernel Hilbert space (RKHS) structure which also generates «/(/). 
Namely, there exists an underlying positive definite kernel C(z,z') such that Re = span(C(-, z) : 
z G M rf ) is the Hilbert space generated by C and J(f) = ||/||% c - The representer theorem implies 
that the minimizer of (16) has an expansion in terms of the eigen-functions 


Ntr 

f( z ) = ( 17 ) 

3 =1 

relating the prediction at z to the kernel function sampled at the design sites z^\ 

Our first family are smoothing (or thin-plate) splines that take 


Af) = 


cL 


E 


d_d_ 

dzi dzj 


m 


dz, 


(18) 


and R as the set of all twice continuously-differentiable functions. It is known [21, Chapter 5] that 
in this case the underlying kernel is given by C(z, z') = ||z — V|[ 2 log ||z — 2 / ||, where |j • || denotes the 
Euclidean norm in R rf . The resulting optimization of (16) along with (18) gives a smooth response 
surface which is called a thin-plate spline (TPS), and has the explicit form 


Ntr 

f{z) = /3 0 + f3 T z + 

3 = 1 


at 


I z — z^ I 


; log 


\z-z® I 


(19) 


with (3= (P i,.. .,Pd) T - 
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( 20 ) 


In 1-d, the penalized optimization reduces to 

Ntr /* 

inf J> (n) - f(z W)} 2 + A / {f'\u)} 2 du. 
fee- ^ J R 

The summation in (20) is a measure of closeness of data, while the integral penalizes the fluctuations 
of /. Note that A = oo reduces to the traditional least squares linear fit f(z) = /?o + Piz since it 
introduces the constraint f"(z) = 0. It is well known that the resulting solution is an expansion in 
terms of natural cubic splines, i.e. / is a piecewise cubic polynomial that has continuous first and 
second derivatives at the design sites z^ n \ and is linear outside of the design boundary. 

Several methods are available to choose the smoothing parameter A, including cross-validation 
or MLE Hastie et al. [21, Chapter 5]. A common parametrization is through the effective degrees of 
freedom statistic df^. We use the R package “fields” [31] to fit multi-dimensional thin plate splines, 
and the base smooth, spline function for the one-dimensional case. 


3.2. Kriging Surrogates 

A kriging surrogate assumes that / in (14) has the form 

f{z) = n(z) + X(z), (21) 

where p : M. d —> M is a trend function, and A is a mean-zero square-integrable process. Specifically, 
A is assumed to be a realization of a Gaussian process with covariance kernel C. The role of C 
is identical to the regularized regression above, i.e. C generates the approximating family He that 
X is assumed to belong to. 

However, kriging also brings a Bayesian perspective, treating A as a random function to be 
learned, and estimation as computing the posterior distribution of A given the collected data 
y = (y^\ ■ ■ ■ > y^ Ntr ^)- The RKHS framework implies that the posterior mean (more precisely 
its maximum a posteriori estimate) of A (z) coincides with the regularized regression prediction 
from the previous section. In the Bayesian framework, C is interpreted as the covariance kernel, 
C(z,z') = Cov(f(z),f(z')) as /(•) ranges over Tic- Assuming that the noise e(z) is also Gaussian 
implies that A(z)|y ~ N(m(z ), s 2 (z)) has a Gaussian posterior, which reduces to computing the 
kriging mean m(z) and kriging variance s 2 (z). 

In turn, the kriging variance s 2 (z) offers a principled empirical estimate of model accuracy, 
quantifying the approximation quality. In particular, one can use s 2 (z) as the proxy for the MSE 
of / at z. Integrating s 2 (z^) over the outer design locations then yields an assessment regarding 
the error of (3). 


3.2.1. Simple Kriging 

Simple kriging (SK) assumes that the trend fi(z) is known. By considering the process f(z) — 
n(z), we may assume without loss of generality that f(z) is centered at zero and p = 0. The 
resulting posterior mean and variance are then [35] 

f m SK {z ) = c(z) 7 C _1 y; 

\ 4k(z) = C (z, z) ~ c{z) T C- l c{z ), 


( 22 ) 


where c (z) = [C(z,z^)) 


l<n<Nti 


and 


C = 


C(zW,z®j 

10 


J 1 <l,j<Ntr 


+ A, 


(23) 



with A the diagonal matrix with entries r 2 ^ 1 )), ...,r 2 (z( Ntr '>). 

3.2.2. Universal Kriging 

Universal kriging (UK) generalizes (21) to the case of a parametric trend function of the form 
p(z) = /?o + Ylj=i Pjhj( z ) where fij are constants to be estimated, and hj{-) are given basis 
functions. The coefficient vector (3 = (/3i ,..., (3 P ) T is estimated simultaneously with the Gaussian 
process component X(z). A common choice is first-order UK that uses hj(z ) = Zj for j = 1,..., d. 
Another common choice is zero-order UK, also known as Ordinary Kriging (OK) that takes p(z) = 
/3o a constant to be estimated. 

If we let h(z) = (hi(z ),..., h p (z )) and H = (h(z^ 1 )),..., h( 2 :^)) , then the universal kriging 
mean and variance at location z are [35] 

f m UK (z) = h(z) T /3 + c(z) T C -1 (y - H/3); 

l = s 2 SK (z) + (h(z) T - c(z) t C -1 H) T (H^C-^H)- 1 (h (z) T - C (zf C-'H) , 

where the best linear estimator of the trend coefficients f3 is given by the usual linear regression 
formula $ = (H t C -1 H) -1 H T C' 1 y. 

The combination of trend and Gaussian process (GP) model offers an attractive framework 
for fitting a response surface. The trend component allows to incorporate domain knowledge 
about the response, while the GP component offers a flexible nonparametric correction. One 
strategy is to specify a known trend (coming from some analytic approximation) and fit a GP 
to the residuals, yielding a Simple Kriging setup. Another strategy is to take a low-dimensional 
parametric approximation, such as a linear function of Z-components, and again fit a GP to the 
residuals, leading to a Universal Kriging setup. 

3.2.3. Covariance kernels and parameter estimation 

The covariance function is a crucial part of a Kriging model. In practice, one usually 

considers spatially stationary or isotropic kernels, 


d 

c(z, z') = c(z - z') = cr 2 n 9(( z - z ')u 

j =i 

reducing to the one-dimensional base kernel g. Below we use the power exponential kernels 
g(h‘,9 ) = exp The hyper-parameters Oj are called characteristic length-scales and 

can be informally viewed as roughly the distance you move in the input space before the response 
function can change significantly, Rasmussen and Williams [33, Ch 2]. The user-specified power 
p G [1, 2] is usually taken to be either p = 1 (the exponential kernel) or p = 2 (the Gaussian 
kernel). Fitting a kriging model requires picking a kernel family and the hyper-parameters a 3 .9j. 
Two common estimation methods are maximum likelihood, using the likelihood function based on 
the distributions described above, and penalized MLE (PMLE). Either case leads to a nonlinear 
optimization problem to fit Oj and process variance cr 2 . One can also consider Bayesian Kriging, 
where trend and/or covariance parameters have a prior distribution, see Helbert et al. [22]. We uti¬ 
lize the R package “DiceKriging” [35] that allows fitting of SK and UK models with five options for 
a covariance kernel family, and several options on how the hyper-parameters are to be estimated. 
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3.2-4- Batching 

To construct an accurate emulator for /(•), it is important to have a good estimate of the 
sampling noise t 2 (z). Typically this information is not available to the modeler a priori. One 
of the advantages of plain nested Monte Carlo is that generating 7Vj n scenarios from a fixed z^ 
gives natural empirical estimates both for /( z and t 2 (z^). To mimic this feature, we therefore 
consider batched or replicated designs V. To wit, given a total budget of Nt r = Nt r , 1 • Nt r ,2 training 
samples, we allocate them into N tj - } 1 distinct design sites z^\ ..., z^ Ntr ’ 1 \ and then generate N tT} 2 
trajectories from each z^ n \ Next, the above batches are aggregated into 

Mr,2 

y {n) = T^£ F < r ' 2<n)J '(')>; (25) 

N,r,2 

Nfr, 2 

f 2 (2 ( ">) = E {» (n) - F(T, *<"«(•))} , (26) 

r ’ J=1 

and the resulting dataset {z^ n \ y^ n \ f 2 (z^ n ' 1 )}, n = 1 ,..., N tr j is used to fit a kriging model for /, 

with T 2 (z^)/N t r ,2 proxying the simulation variance at 

The efficient allocation between Nt r 1 and A ^ r ,2 was analyzed in Broadie et al. [7] for a related 
risk management problem and it was shown that the optimal choices satisfy 

Ntr,i oc N 2 J\ N tr , 2 oc Nl /3 . (27) 

This is also the allocation we pursue in this paper, so that there are relatively many more design 
sites than replications in each batch. 

3.3. Experimental Design 

Several approaches are possible for constructing the training design V. First, one may generate 
an empirical design by independently sampling z( n l ~ Z(T)\Z(0). This allows to emulate the 
conditional density pr(z\Z(0)) which is advantageous for computing an expectation like in (3). 
Second, one may generate a random V using some other proposal density z^ ~ Q. For example, a 
uniform proposal density (i.e. z^ i.i.d. uniform in some domain D C R rf ) yields a basic attempt in 
having a space filling experimental design of arbitrary size. A more structured (but still random) 
design can be obtained via Latin Hypercube Sampling (LHS) techniques [38]. Roughly speaking, 
LHS builds a regular d-dimensional lattice and then attempts to equidistribute N tr ^ sites among 
the resulting hypercubes. Within each selected hypercube the design site is placed uniformly. 

Third, one can use a deterministic design, such as a latticed grid, or a quasi-Monte Carlo 
(QMC) sequence. Deterministic designs ensure a space-filling property and easy reproducibility. 
For example, the Sobol sequence [37] redistributes a uniform binary grid to produce a grid that is 
maximally equidistributed. Compared to LHS, use of QMC is faster (as it can be directly hard¬ 
coded) and can be manually tweaked as needed. Both methods reduce Monte Carlo variance of / 
relative to empirical T>. Theoretically, the typical domain of Z(T) is unbounded, e.g. M+ d . This 
is not an issue for empirical design construction; for LHS and QMC methods, one must specify an 
appropriate bounding domain D € before generating T>. 

Remark. Depending on the context, the design V might need to be spatially non-uniform. For 
example, if using a deterministic design for computing (3), it may be preferable to capture the 
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correlation structure among the components of Z(T), or to up-weigh the regions most likely for 
Z[T ). If one is estimating a quantile or tail expectation, V should preferentially cover the extreme 
values of the distribution of Z(T ); in that situation, an empirical design would be inappropriate. 

3.3.1. Generating Longevity Scenarios 

Construction of an emulator entails the basic building block of generating a longevity scenario 
{Z(t),t = 0,..In the simplest setting, this just requires to generate and manipulate a sequence 
of i.i.d Uniform draws that describe the random increments of the (components) of Z. However, 
typically the model used also includes parameters that must be estimated or calibrated. This 
aspect becomes nontrivial when future longevity projections are made, whereby model re-fitting 
may be carried out. Re-fitting introduces path-dependency, making parameters dynamic quantities 
that might need to be included in Z. For example, Cairns et al. [13] advocate the PPC (partial 
parameter certain) scenario generation that breaks the overall simulation into two pieces of [0, T] 
and [T, oo). With PPC, one initially calibrates the model at t = 0 using past mortality data and 
then simulates up to time T. The simulated scenario is then appended to the historical data, so 
that the simulation becomes the new “history” from time 0 to time T. The model parameters are 
then re-fitted at T and the resulting, modified longevity dynamics of Z are used to simulate beyond 
T. The idea of PPC is to capture some memory of mortality evolution, in essence removing some 
of the presumed Markovian structure. Under PPC the refitted parameters are blended into Z(T ) 
since they affect the resulting F(T. Z(-)). 

Also, in the interest of dimension reduction, one could drop some components of the full state 
space when constructing the emulator. To do so, one may analyze what dynamic variables mate¬ 
rially impact annuity values, for example via some simple regression models to test for statistical 
significance. 


3-4 • Fitting and Evaluation of Emulators 

To fit an emulator for a given simulation budget JV( r , we first decompose A) r = A)r, l x Nt r ,2 
and then construct an experimental design V of size Wr,i using one of the methods in Section 3.3. 
Each site in V then spawns Wr ,2 trajectories that are batched together as in (25). Fitting is done 
in R using the mentioned publicly available packages. For kriging, we use the default setting of the 
km function in DiceKriging package. 

Given a(z, T, x ) we evaluate its performance across a test set V test = (z^\ ..., z^ Nou *i) of N out 
locations. Note that T> test is distinct from the training set T>. In line with (3) we use an empirical 
testing set ~ Z(T)\Z(0). Since the true values a(z, T, x) are not available, we benchmark 

against an (expensive) gold standard estimate a MC (z,T,x) that is described below. In particular, 
we record the integrated MSE and Bias statistics from (12), namely 


IMSE(a) 


Bias(a) 


1 


N 0 ut 


N OU t 

1 


E (a(zW,T,x)-a MC ( Z W,T,xj) ; 


n= 1 
N 0 ut 


N, 


out 


E [a(z^ n \T,x)-a MC (z^\T,x) 


n= 1 


(28) 

(29) 


For benchmarking, we use a high-fidelity nested Monte Carlo approach (5)-(6). While expensive, 
it is a simple, asymptotically consistent, unbiased estimator. Specifically, for valuing annuities, 
a MC (z,T,x ) is obtained by averaging W OT = 1000 scenarios of {Z(s),s > T} at each z^ 1 ) g x> test . 
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Unless indicated otherwise, we use N out = 1000, so that the overall budget of a MC is 0(N out xN out ). 
We then compare against emulators that use Nt r G [100,8000], which yields an efficiency gain on 
the order of 10-50x speed-up. We also compare against deterministic estimators that require 
no training at all (but do need an analytic derivation), and take just 0(N out ) budget to make 
predictions for the outer N out simulations to evaluate (28). 

4. Case Study: Predicting Annuity Values under a Lee-Carter with Shocks Framework 

Chen and Cox [14] introduced a mortality model based on the traditional Lee Carter set-up: 

log m(t,x) = ^ l \x) + /3^ 2 \x) K^ 2 \t). (30) 

This is the same as the APC model (M2) in Appendix A without the cohort term. In the Chen-Cox 
model, and fd^ 2 \x) are deterministic vectors capturing age effects, and re^(f) is a stochastic 

process capturing the period effect with dynamics 

K (2) (f + 1) = re ( 2 ) (; t) + p {1) + £ (1) (f + 1) + [£ (2) (f + 1) - £ ( 2 ) (t)], (31) 

where £^(t) ~ A(0, a^) and £^(t) has an independent zero-modified normal distribution with 
P(£( 2 )(t) = 0) = 1 — p, and Gaussian parameters (p^ 2 \a^). The motivation for (31) is to incor¬ 
porate idiosyncratic mortality shocks represented by f^ 2 \ that occur with probability p any given 
year and have a random magnitude with distribution N(p^ 2 \ a ^). Such shocks, representing 
natural or geopolitical catastrophes, are temporary and last just a single period, hence subtrac¬ 
tion of the last term —^ 2 \t) in (31). Due to this term, it would appear that the model has a 
two-dimensional state space {^^(t), However, we note that it is sufficient to generate 
scenarios starting with k^ 2 ^(T) and assuming ^ 2 -*(T) = 0 (no shock in year T). Then after esti¬ 
mating /(re) = K[F(T, K;( 2 )(-))|fv( 2 )(T) = re,£( 2 )(T) = 0], one easily obtains in case of year-T shocks 
E[F(T, k( 2 )(-)|k( 2 )(T) = re, £^(T) = £] = /(re — £), reducing to the prediction of “unshocked” 
values. 

The presence of idiosyncratic shocks in m(t, x) renders the corresponding survival probability 
analytically intractable. However, the linear dynamics of re^ 2 ^ in (31) allows to obtain the following 
deterministic estimator for future mortality rates. 

Lemma 1. Let Z(s) = (s),£^(s)}. Under the Chen-Cox model, the following holds: 

E[re^(/) | Z{s)\ = re^(s) + (t — s)p^ + p^p — ^ 2 -*(s), 0 < s < t < oo. (32) 

The proof can be found in Appendix B. Substituting (32) into (30) yields the following 
estimator for E[m(T + s,x) \ ^(T), ^ 2 \T)} : 

m(T + s,x) = exp ^/3^(x) + p( 2 \x) ^ 2 \T) + sp 1 ' 1 ' 1 + p^p — ^( 2 )(T)^ . (33) 

4-1. Results 

We follow Chen and Cox [14] in using US mortality data obtained from the National Center 
for Health Statistics (NCHS) 2 . This dataset contains yearly age specific death rates for overall US 


2 Source: http://www.cdc.gov/nchs/nvss/mortality_tables.htm 
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population over 1900-2003. Fitting yields the random-walk parameters //B) = —0.2173, <tB) = 
0.3733 in (31), as well as the estimated probability of shock as p = 0.0436, with jump distribution 
(// 2 ) ; <j( 2 )) = (0.8393,1.4316). As expected, /i^ 2 ) 3> 0 is large and positive, so shocks correspond to 
large temporary increases in mortality. The goal is to analyze and compare the ability of kriging 
models and analytic estimates to predict T = 10-year deferred annuity values for unisex x = 65 


olds. Payments 

are cut-off at 

age x = 94 

. We use a 

discount rate of r = 4%. 



N tr = 

125 

N tr = 

= 512 

N tr = 

1000 

Type 

Bias 

VIMSE 

Bias 

VIMSE 

Bias 

VIMSE 

Analytic 

1.668e-03 

2.148e-03 

1.668e-03 

2.148e-03 

1.668e-03 

2.148e-03 

Ord. Kriging 

5.145e-03 

5.923e-03 

1.582e-04 

1.975e-03 

-1.999e-04 

1.634e-03 

Univ. Kriging 

5.832e-03 

6.059e-03 

4.816e-04 

1.045e-03 

-1.243e-05 

7.428e-04 


Table 1: Comparing estimators for life annuity value under the Chen-Cox model for different size of experimental 
design. The design T) is constructed with Nt r = A^f/i ■ NH\. The reported values are evaluated from a Monte Carlo 
benchmark, using (28) and (29). Analytic estimate is based on (33); universal kriging model uses first-order linear 
basis functions. 

W fit emulators with budgets N tr £ {125,512,1000}. The respective training designs V are 
deterministic and uniformly spaced across an appropriately chosen interval D = [ac, k] ; a fixed 
design minimizes Monte Carlo noise in fitting /(•)• Because Z = k^ 2 ) is just one-dimensional, a 
relatively small training budget is used. For the emulators, we fit both an ordinary kriging (OK) 
model with constant trend /x(/c) = /3q, and first-order linear universal kriging (UK) model with 
fi(n) = /?o + Pin- For evaluation, we fix a testing set containing N out = 50 values of Z(T ), each 
with a Monte Carlo benchmark containing N tn = 10 5 simulations. Due to the very small MSE’s 
involved, a very high-fidelity benchmark was needed (in order to isolate the MSE of the emulator 
from the MSE of the benchmark), leading to a very large N tn . To be computationally feasible, we 
picked a small testing set. To make sure that T> test accurately represents the distribution of Z(T) 
its locations were picked as the empirical 1%, 3%,..., 99% percentiles of a large sample of Z(T). 
The mortality shocks associated with these percentiles were used in the comparison process. 

Table 1 and Figure 1 summarize the results. We observe that there is quite a wide spread 
in potential future annuity prices, with differences of more than 10% (or $1 in annuity NPY) 
depending on realized Z(T). This confirms the significant level of longevity risk. As shown in the 
Figure 1 , there is a nearly linear relationship for z H > a(z,T,x), which is perhaps surprising given 
the above range of forecasts. This strong linear trend in the response partly explains the advantage 
of the UK model over OK. The Figure also reflects the effect of training set size and distribution: 
the Nt r = 512 model performs significantly better than its A% = 125 counterpart. We see that all 
methods perform well, with IMSE’s on the order of le-03. Even though the computed biases are 
rather small, we remark that since pension portfolios have very large face values, the corresponding 
approximation errors could be financially meaningful. For example, for a modest pension fund with 
an obligation of $ 100mm, a bias of le-03 implies inaccuracy of $100k. 

The right panel of Figure 1 provides a zoomed-in visualization of the estimators’ bias relative 
to a MC . As expected, the analytic estimator based on Lemma 1 overestimates the true annuity 
value for all k,( 2 \T). For = 125, the kriging emulator clearly has a larger MSE, and in this 
case typically under-estimates cl(k( 2 \T)). For = 512 we observe the statistical learning taking 
place, as the kriging model now has an excellent fit in the middle of the plot and essentially zero 
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25x5 



Figure 1: Annuity emulators in the Chen-Cox model. Left: three estimators (MC, UK w/N tr = 125 and analytic) of 
annuity value a(«;^ 2) (T)) vs. /d 2) (T’). The training design (indicated by the vertical dashed lines) is V = £ 

(—17.5,-10)} with Ntr, l = 25, Ntr ,2 = 5. Right: relative annuity values vis-a-vis the Monte Carlo benchmark a MC 
obtained with Ni n = 10 5 . 

bias averaging over potential values of k^ 2 \T). The effect of larger training budget is confirmed in 
Table 1, with IMSE’s all decreasing towards zero as Nt r increases. 

The above analysis demonstrates that in some settings, the shape 2 i— > a(z , T, x ) is sufficiently 
simple that little modeling is required, and analytic estimators perform well (as do statistical 
emulators). However, we stress that there is no easy way to tell a priori that the analytic estimator 
would be adequate, and in any case a sufficiently large training set size will guarantee a better 
predictive power for the kriging models. 

The one dimensional case also provides a visual representation of the effect of grid design, 
illustrated in Figure 2. The figure showcases two features of emulators: (i) dependence between 
local accuracy as measured by s 2 (z ) and grid size N tr ; and (ii) dependence between s 2 (z) and 
grid shape. First, larger training sets improve accuracy (with a general relationship of 0(N tr ' ) 
like in plain Monte Carlo). This can be seen in Figure 2 where kriging standard deviation s(z) is 
consistently lower for = 1000 compared to = 125. One implication is that as Nt r -A oo, 
we would have s 2 (z) -A 0 , i.e. /(•) would be learned with complete precision, a property known 
as global consistency of the emulator. Second, s 2 (z) is affected by the shape of V in the sense 
that higher local density of training points lowers the local posterior variance. This is intuitive if 
viewing / as an interpolator or kernel regressor - the denser the training set around z, the better 
we are able to infer f(z). Consequently, the empirical grid T>^ that is concentrated around the 
mode of Z(T), offers better accuracy in that neighborhood (around k,^ 2 \T) ~ —14 in Figure 2) 
compared to the uniform T>^ B \ but lower accuracy towards the edges, where becomes sparser. 
For all designs, posterior uncertainly deteriorates markedly as we migrate outside of the training 
set (e.g. k( 2 \T) > —11 in the Figure). 

The s(z) values shown in Figure 2 also provide an approximation of emulator IMSE. For 
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Figure 2: Effect of training design V on the emulator accuracy in the Chen-Cox model. We show kriging standard 
deviation s(z) for the universal kriging model with three different designs: (small uniform), (large uniform) 

and (large empirical). The deterministic designs contain uniformly spaced values of ac^ 2) (T) G 

(—17.5,10), of size = 125 and = 1000 respectively. is an empirical design of size Nj:^ = 1000 

generated using the density of kS 2 '* (T)\E 2 ' > (0). 


example, averaging the kriging standard deviation s(z) over the testing set using the UK model with 
Ntr = 1000 yields SAve = \J{w~; Yin s 2 (z^ n '>)} = 7.159e — 03, while in Table 1 the corresponding 

reported IMSE was IMSE = 7.428e — 04. Reasons for the mismatch include the residual MSE in 
the Monte Carlo estimate a MC and model mis-specihcation of the UK model, which would bias 
the self-assessed accuracy. Moreover, the strong correlation between a(z) across different testing 
locations z^ implies that IMSE has a large standard error. Nevertheless, SAve is a highly useful 
metric that allows to quantify the relative accuracy of different emulators in the absence of any 
gold-standard benchmarks. 

5. Case Study: Hedging an Index-Based Fund in a Two-Population Model 

There has been a lot of recent discussion regarding index-based longevity funds. Information 
on the death rates of the general public is widely available, and a market fund that uses the 
respective death rats as its price index offers a standardized way to measure population longevity. 
In particular, it allows for securitization of longevity swaps that can be used by pension funds to 
hedge their longevity risk exposure. If the pension fund could buy as many units of the swap as 
it has to pay out to its annuitants, it would result in a situation where the amount paid is nearly 
equal to the amount received from the swap. The quality of such as hedge is driven by the basis 
risk between the indexed population and the annuitant pool, that is typically a subset of the index. 
Consequently, it is necessary to create a model to capture the link between the index and the 
insured sub-population. 
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Remark. From a different angle, some longevity products explicitly integrate mortality experience 
in several regions, for example across different countries (UK, Germany, Netherlands) or across 
different constituencies (England vis-a-vis Great Britain). Lin et al. [30] states that most mortality 
data reported by official agencies calculate a weighted average mortality index of different under¬ 
lying populations. They also investigate the modeling aspect of such multi-population indices. 

To fix ideas, we call the index population Pool 1, and the annuitants Pool 2. Consider now an 
individual from Pool 2 who will be aged x at date T when she begin to receive her life annuity. The 
corresponding time-T liability to the pension fund is denoted a 2 (Z(T),T,x). If the pension fund 
enters into a swap based on the index, she might purchase ir index-fund annuities for age x , with 
net present value of irai(Z(T), T, x), at T. For now we ignore what would be a fixed premium. 

The overall hedge portfolio is then A (Z(T),T,x) = irai(Z(T),T,x) — a 2 (Z(T),T,x). Several risk 
measures can be used to determine hedge effectiveness. Some examples include variance, or tail risk 
measures such as value-at-risk (VaR) or expected shortfall (TVaR). Recent work in this direction 
includes Coughlan et al. [15] who used a bootstrapping and extrapolation method to analyze hedge 
effectiveness, and Cairns et al. [13] whose setup we follow below. 

Unsurprisingly, the correlation structure for mortality across populations is complex. One 
notable recent contribution is by Cairns et al. [ 12 , 13] who considered a hedging problem between 
an index pool k = 1 and insured sub-pool k = 2 . Specifically, the two populations are the England 
& Wales (E&W) general population, which represents the index mortality rate (Pool 1), and the 
Continuous Mortality Investigation (CMI) population, which are mortality rates gathered from 
United Kingdom insured populations, serving the role of those receiving pension payments (Pool 
2 ). To model the dependence between the two pools, Cairns et al. [ 12 ] proposed a cointegrated 
two-population Bayesian model based on the Lee-Carter framework. To wit, the mortality rates 
mk(t,x ) behave similar to ( 11 ), 

log m k (t,x) = /3 < j}\x) + n- 1 n ( £\t)+'y^\t-x), k = 1,2 
with stochastic dynamics of the period effect re® given by 

«i 2 \t) = 2 \t - 1) + in + criei(t), ei(t) ‘~ d N(0, 1). (34) 

In turn, the mortality of the larger population influences the period effect of the smaller (insured) 

population, with dynamics for re® co-integrated with re®. Namely, their difference S(t) = re® (t) — 

(2) 

re) (t) forms an AR(1) process 

S(t) = 112 + <!>{S{t - 1) - H 2 ) + cr 2 e 2 (t - 1) + cei(t - 1 ), e 2 (i) ~ JV(0,1), (35) 

/o') (o) 

with ei(-) being independent of e 2 (-), and c = oi—pu 2 for the covariance, where p = Corr(re) (t), n y 2 {t )). 
In both models cohort effects ; are independent AR{ 2) processes. Here, (f> is the mean-reversion 
rate. Since (35) models the difference re)" (t) — re) (t). (f> reflects the rapidity of how S(t) returns 
to the constant /j 2 , which is assumed to be the constant difference between the two populations. 

5.1. Analytic Approximations 

Cairns et al. [13] used the fact that E[re® (T + t) \ re[ 2 '(T)] = k[ 2 \t) + pit to introduce the 
median-mortality approximation 

m^ l (T + t,x + t) = exp (x + t) + — (re® (T) + pit ) + — 7 ® (T - x)') . (36) 

V n a n a ) 
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Since S(t) is mean reverting, it is also suggested to use the approximation for the CMI population 
of 


(T + t,x + t) = exp ( /3^ (x + t) + — (T) + fi\t) + —■ 7 ^ (T - x) 

IT’a fla 


(37) 


i.e. the same drift as the general population but different initial value. 

We introduce a different, more accurate approximation based on the following lemma. 

Lemma 2. We have 


E 


4 2) (T + t)\Z(t)] = k™(T) + - H 2 O- - </>*) - ~ 


(38) 


( 2 ^ 

The proof can be found in Appendix B. Denote E[/d> (T+t)\Z(t)\ = £(i, T). Lemma 2 suggests 
an alternative analytic estimator for m 2 {T + s,x) as 


mf(T + s, x + s) = exp ( j3^\x + s) + £(t, T) + — 72 (T - x) ) . 


(39) 


Denote ai(Z(T)) and 02 (Z(T)) as the net present value at T (conditional on Z(T)) of a life 
annuity for the E&W and CMI populations respectively as defined in (9). In what follows, Analytic 
1 will refer to use of (36) and (37) in estimating survival probabilities (10) for each population 
(and hence a\ and 02 ), while Analytic 2 refers to the use of (36) and (39). Notation for deferred 
annuity values under the two analytic approaches will be l (z) and a^ 2 (z), k = 1 , 2 . 


5.2. Model Fitting 

The parameters P^\x), /3 ^(.x), and past trajectories K^\t), 7 ^\t — x), for k = 1, 2 were 
estimated from the male E&W and CMI populations respectively, and the time and age ranging 
from calendar years 1961 to 2005 (with 2005 treated as t = 0), and x from 50 to 89. The pro- 
cesses (k\ (t )) and (S(t)) were fit as random walk with drift and AR( 1) respectively, introducing 
additional parameter estimates for m, 07, H2, <f>, 02 and c. We find //1 = —0.5504, H2 = 0.6105, = 

1.278, (72 = 0.568, 4> = 0.9407, c = 0.262, so that the CMI population tends to have higher mortality, 
with a co-integration of about 94%. 

Using the PPC approach of [13], we treat the age-effect parameters as fixed, and refit the 
ARIMA models at period T for each simulation. That is, the /3^ are fixed throughout for 
k = 1,2 and each of /xi, < 72 , H 2 , <f), 07 , and c are re-estimated. In principle, this makes the re- 
estimated parameters part of the state variable Z(T). A few preliminary runs indicate that the 
variance parameters (J \, 07 and c have little significant effect on annuity values, while fii , /J 2 and 
(j) do. Since y\ is in one-to-one correspondence with k^(T), our time T state process is finally 
characterized as 

Z(T) = {4 2) (T),4 2) (T),/x 2 ,</>}. 

Heuristically, this is a reasonable choice: each element of Z(T) has a direct effect on the time T 
mortality rates or their trends, while the variance terms simply add variability. 

Several stochastic mortality models have R code available 3 for model fitting. We use the code to 
fit the two-population model parameters, yielding the inferred past trajectories for the age, period, 


3 LifeMetrics Open Source R code for Stochastic Mortality Modelling; see http://www.macs.hw.ac.uk/~andrewc/ 
lif emetrics/ for details 
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and cohort effects. In a separate step, the estimated period and cohort effects are modeled as 
individual ARIMA models. 

For the remainder of this section we assume the starting age of the annuitant is x = 65 with 
a fixed interest rate of r = 0.04 and a T = 10 year deferral period. Generally the choice of hedge 
ratio 7r is chosen systematically, for example through minimizing variance. In this paper we assume 
the neutral value of 7r = 1 in order to not favor one estimation type over another. Hence the value 
of the hedge portfolio is A (Z(T)) = a\{Z{T)) — a, 2 {Z(T)). 

As discussed in Section 3.3, determining the training set design depends on the problem at 
hand. In our particular example with a 4-dim Z(T), we aim to give an accurate result of the 
expectation of the hedge portfolio A(T), so we use an empirical design, as suggested in Section 

3.3. This also holds the advantage of capturing the correlation between and which is 
important in this co-integrated model. To compare the effect of budget size, we choose two different 
budgets, Ntr = 1000 and Nt r = 8000. Following the framework in Section 3.3, Nt r is allocated into 
N tr ,i = N tr ,2 = A) 1 / 3 , so that we have N tr ,i = 100 (resp. N tr ,i = 400) training points with 
Monte Carlo simulations containing Nt. r , 2 = 10 (resp. Ntr, 2 = 20) batched simulations for each 
design point. 

Different surrogate models are chosen than in Section 4; this time around a multi-dimensional 
state process suggests the use of a TPS model from Section 3.1. We forego the OK model, but 
maintain use of the lst-order linear UK model, and also implement a simple kriging (SK) model with 
[i(z) = a/ 2 (z) — at/ 2 (z). This combines advantages from both the analytic and UK approach, giving 
us an already accurate estimate for the trend, while nonparametrically modeling the residuals. 
For these reasons, a SK emulator should outperform both the analytic estimators and the UK 
model. We utilize another advantage of the surrogate models and fit them directly to the hedge 
portfolio values A (Z(T)) rather than individually modeling annuity values cik{Z(T )) and then 
taking difference of two approximations. 

The Monte Carlo benchmark yields an average portfolio value of 0.1995 with a standard de¬ 
viation of 0.1067. This suggests that a one-to-one purchase of index annuity is not the optimal 
hedge unit under this population model. In an actual application, one would analyze further to 
determine a different choice for ir; for example Cairns et al. [13] chooses 7r to minimize portfolio 
variance. 

5.3. Results 


Type 

N tr = 

Bias 

1000 

VIMSE 

N tr = 

Bias 

8000 

VIMSE 

Analytic Al from (37) 

-2.101e-02 

3.460e-02 

-2.101e-02 

3.460e-02 

Analytic A2 from (39) 

4.480e-03 

5.321e-03 

4.480e-03 

5.321e-03 

Thin Plate Spline 

2.577e-03 

1.304e-02 

5.803e-04 

5.095e-03 

Universal Kriging 

4.363e-04 

1.856e-02 

1.857e-03 

1.289e-02 

Simple Kriging 

-1.334e-03 

3.280e-03 

9.390e-04 

3.043e-03 


Table 2: Performance of analytic estimates and surrogate models for hedge portfolio values in the two-population 
model case study. Numbers reported are based on N ou t = 1000 simulations of Z(T) with a Monte Carlo benchmark. 
N tr is allocated into N tr , 1 = lV t 2 / 3 training points and N tr ,2 = IV// 3 Monte Carlo batches per training point. Simple 
kriging model uses A2 estimator as trend. 
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Figure 3: Boxplots of hedge portfolio value bias for Ntr = 8000 for analytic A2 and simple kriging approaches. To 
construct the boxplot, we computed for each of 1000 simulated values of Z(T), the difference between the respective 
estimate and the Monte Carlo benchmark. 


We choose N out = 1000 simulations of Z(T) and predict hedge portfolio values A (Z(T)) with 
the surrogate models, as well as via the deterministic estimates. Table 2 shows the results. As 
expected, the Analytic A2 estimator outperforms Analytic A1 since it is catered directly to the 
two-population model. Relative to Al, our improved estimator cuts bias by nearly 80%. As for 
the surrogates, when Nt r = 1000, each of the TPS and UK models only slightly underperform the 
analytic estimate A2, while the SK model does significantly better. For = 8000, both TPS and 
SK are better than A2. 

Figure 3 summarizes the empirical distribution of the bias of the A2 and SK estimators given 
simulations of Z(T). We can see that both approaches have similar variability, while SK has a 
much lower bias. The UK and TPS estimators have similar distributions with slightly larger bias 
than SK. 

There are a few comments to be made in regards to these results. First of all, there is no 
way to tell a priori that a deterministic estimate will perform well. For example each surrogate 
model completely outclasses Al, while TPS and UK perform only marginally better than A2. 
Possibly, even better (or worse) analytic estimators can be derived. Additionally, the deterministic 
estimators are for annuity values themselves and not for the portfolio difference A(T). A lower 
bias for a A(T) could simply be a consequence of the bias of each annuity a,k(Z(T)) being reduced 
during subtraction. 


6. Case Study: Predicting Annuity Values under the CBD Framework 


6.1. Model Fitting 


Our third case study utilizes another popular class of mortality models, the CBD [9] models 
which directly work with the survival probabilities P. To wit, we model the 1-period survival 
probability 


P{Z(T)-,T, 1, x) 


1 

1 + exp [k^(T) + (x~ XAve)^ 2 \T)) ’ 


(40) 


where XAve = n a 1 Xj, and k^ follow ARIMA models, which according to Cairns et al. [10] 

provide a good fit for period effects. Multi-period survival probabilities are obtained as products 
of (40). 
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We fit (40) to the CMI population, considering a full range of ARIMA(p,d,q ) models with 
p,q = 0,1, 2,3,4 and d = 0,1,2, using auto.arima in R from the package “forecast” [23]. The 
optimal configuration for this population is for to follow ARIMA(0, 1,3) with drift and P 2 ^ 
to follow ARIMA(1 ,1, 2): 


3 

K (!) (f) = « (1) (t - 1) + n + e (1) (t) + 6> fel) e (1) (t-q), (41) 

q =i 

P 2 \t) = (1 + (j))P 2 \t — 1) — c/)K^ 2 \t — 2) + e^ 2 \t) + ^ #(<2> 2 ) e ( 2 )(£ _ g). (42) 

q =i 

The estimated ARIMA parameters are n = — 0.0195, cj> = 0.9206, = —0.5516, = 

0.1736, = 0.5169, 6 {2 ^ = -1.4664, 9^ = 0.6167. The 6 { ^ k \k = 1,2 describe how past 

errors echo into future values of P k \ For example, the large negative value of t^ 2,1 -* means that the 
noise generated in reW (s) will be amplified, made negative, and added to the future kW(s + 2). The 
above equations imply that the state has three components, Z(T) = {k^(T), P 2 \T), P 2 \T — 1)}. 

As in the previous case studies, we develop a deterministic estimate for survival probabilities. 
Denote by ^ k \t,s ) = E[K^(f) | Z(s)\ for k = 1,2. The expressions for are as follows. 


Lemma 3. The following hold for t > s 
£ ( 1 ) (t,s) = k ( 1 ) (s) + p{t- s); 




<fP 2 \s — 1) — P 2 \s) 


(43) 

(44) 


The proof can be found in Appendix B.3. Based on Lemma 3, and substituting expected 
values of P k \s) into (40) we obtain a deterministic estimate of the a-year survival probability as 
the product 

M—i j 

P ( Z(s),t,u,x) n 1 + exp (^(i)(t + j,a) + (x + j-x i 4ve)^(2)(i + j,s))' 

Through equation (9), this yields the estimate for the T —year deferred annuity given Z(T ) : 


x—x 

d det (Z(T)-T , x) = J2 e~ rs P(Z(T),T, s, x), (45) 

S=1 

where the cutoff age is x = 89. 

We proceed to value life annuities in the above model. In contrast to the first two case stud¬ 
ies, we extend the deferral period to twenty years. An additional ten years of evolution imbues 
significant uncertainty into the mortality state Z(T). We use an empirical training design V for 
this case study for two reasons; one being that the correlation structure is problematic with any 
organized grid. From (42), we see that P 2 \20) and k^( 19) should be strongly correlated, while 
both k^ 2 )(19) and k( 2 )(20) are independent of /td)(20). Secondly, the long deferral period causes 
significant variation in the distribution of Z(20), and with expectation in mind, we desire the 
accurate capturing of the density of Z( 20) that the empirical grid will provide. The algorithms 
discussed in Sections 3.3 and 3.4 are used to generate the design and fit the surrogate models. As 
in Section 4, we choose an ordinary kriging and lst-order linear universal kriging models, and also 
fit a thin plate spline model as used in Section 5. 
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6.2. Results 


Type 

N tr = 

Bias 

1000 

VIMSE 

Ntr = 
Bias 

8000 

VIMSE 

Analytic 

-4.560e-01 

5.257e-01 

-4.560e-01 

5.257e-01 

Spline 

-2.358e-02 

6.719e-02 

4.195e-03 

5.436e-02 

Ord. Kriging 

3.669e-03 

9.785e-02 

9.734e-03 

7.743e-02 

Univ. Kriging 

-1.785e-03 

5.844e-02 

5.635e-03 

4.355e-02 


Table 3: Performance of analytic estimates and surrogate models for 20-year deferred annuity values under the CBD 
framework. Numbers reported are based on N ou t = 1000 draws of Z('20). Nt r is allocated into Ntr, l = JVf/ 3 training 
points and Ntr ,2 = Ay/ 3 Monte Carlo batches per training point. Analytic estimate refers to (45), and Spline to thin 
plate spline (TPS) model. Universal kriging model uses linear basis functions. 

In contrast to the results in Sections 4.1 and 5.3, Table 3 shows that the analytic estimator 
(45) crumbles under this volatile model and long deferral period. On the other hand, both kriging 
models produce reasonable results even with Nt r = 1000. We can also observe a diminished effect 
of increasing the training set size, due to the increased model variance. 

These results reflect the comments made in the previous sections: the analytic estimate is a 
parametric guess as to what may provide an accurate result, and that guess is not always correct. 
Our analytic choice in this case study was derived along identical lines as to the analytic estimates 
in the other case studies, yet performs substantially worse. In comparison, the statistical learning 
frameworks provide a reliable estimator even in a volatile model with a three-dimensional state 
process and long deferral period. 

7. Conclusion 

The three case studies above showcase the flexibility and admirable performance of the surrogate 
models across a range of various longevity risk dynamics. Compared to the consistent accuracy of 
the statistical emulators, the quality of the deterministic projections was widely varying. Because 
an analytic derivation is required to produce a deterministic estimator, there are several plausible 
estimators available. In Section 5 we derived two different estimators, both of which were viable, 
but one underperformed. Similarly, in Section 6 the derived deterministic projection was also 
inaccurate. Overall, these examples show that our models can outperform deterministic projections 
and provide minimally-biased estimates. 

Relative to the analytic estimates, the case studies in section 5 required the training set size 
to be sufficiently large, while in Sections 4-6 a small design of less than 10 3 simulations yielded 
accurate models. In terms of individual model performance, it was without surprise that the 
SK model in Section 5 produced the best results. The analytic estimate was already performing 
adequately, and the SK model used it as its trend component to improve accuracy even more. The 
downside to this is that a well-performing deterministic estimate was required. 

Throughout the paper we have hinted at possibilities for further work. Straightforward exten¬ 
sions include using other mortality models, or emulating other insurance products, e.g. variable 
life annuities. One can also build more dynamic surrogates that treat initial age x (fixed in our 
case studies as x = 65) or deferral period T as part of the state Z , providing a joint prediction for 
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(Z(0),T,x) i—^ E [a(Z(T),T,x)\. Similarly, one could consider more parameter uncertainty which 
would lead to including additional components in the state Z. 

The emulators we obtained for a{Z(T),T,x) offer a high-performance tool for annuity risk 
management. Indeed, they are based on advanced, previously vetted stochastic mortality models 
and calibrated to real, reliable, large-scale mortality datasets. Hence, the fitted estimates for 
annuity values are in essence a best-available forecast that combines state-of-the-art longevity 
modeling, data calibration and statistical model. As such, (after incorporating age and interest 
rate as model parameters) they would be of independent interest to actuaries working in longevity 
space and seeking easy-to-use tools for forecasting net present values of life annuities. The emulator 
offers a plug-and-play functionality, converting inputted parameters (such as age x, deferral period 
T and discount rate r) into the annuity value (note that the initial state Z{ 0) is read off from 
the calibration procedure). One can imagine building a library of such emulators for different 
mortality-contingent products available in the marketplace. 

Looking more broadly, the emulation approach we propose is very general and can be applied in 
a variety of actuarial contexts. In particular, in future work we plan to extend it to the microscopic 
agent-based models of mortality Barrieu et al. [2] which offer a canonical “complex system” repre¬ 
sentation of population longevity. We believe that emulators could significantly simplify predictions 
in these types of models by providing a tractable, statistical representation of demographic interac¬ 
tions within a stochastic dynamic population framework. Another class of insurance applications 
requires functional-regression tools where emulators can again be very effective [20]. A different 
extension is emulation of risk measures related to F(T, Z(-)), such as VaR or TVaR, which re¬ 
quire targeted surrogates that focus on a specific region of the input space. A starting point is to 
combine concept of importance sampling to generate a targeted design V that e.g. preferentially 
concentrates on the left tail of F. 
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Appendix A. Lee Carter & CBD Stochastic Mortality Models 

In this section we give a brief summary of existing stochastic mortality models. We use the 
notation of Cairns et al. [10] who provided a comprehensive comparison of several mortality models 
using CMI data. 

The APC Lee-Carter model (introduced by Renshaw and Haberman [34]) models the log mor¬ 
tality rate as 

log m(t,x) = + (5 <y2 \x)K ( ' 2 \t) + /S*- 3 '* (x) 7 ® (t — x). (M2) 

One can interpret /3^\x), n^ 2 \t) and 7 ® as the age, period and cohort effects, respectively. The 
original model proposed by Lee and Carter [28] is a special case where 7 ® = 0. The age effects 
/3^ k \x),k = 1,2,3 are estimated (non-parameterically) from historical data, while the period and 
cohort effects are taken as stochastic processes. In the original proposal in [28], the period effect 
is assumed to follow a random walk (i.e. unit root AR( 1) in discrete time), 

K W(t) = ^Xt-l) + ^+aWeV\ 

where // 2 ) is the drift, is the volatility, and ~ 1V(0,1) i.i.d. is the noise term. Alternatively, 
Cairns et al. [10] mention that ARIMA models may provide a better fit, in particular fitting an 
ARIMA{ 1,1,0) process for n^ based on 2007 CMI dataset. 

For the cohort effect, Renshaw and Haberman [34] suggested using ARIMA models for 7 — 
x)\ Cairns et al. [10] recommend the use of either ARIMA( 0,2,1) or ARIMA{ 1,1,0). Renshaw 
and Haberman [34] and Cairns et al. [10] both assume 7 ^ is independent of . 

This model has identifiability issues, and one set of constraints could be 

J> (2) (*) = 0, E^ (2) ^) = 0 ’ j> (3) (*-*) = o, and E^ (3) ^ = 1 - 

t X X,t X 

From a different perspective, Cairns, Blake, and Dowd [9] (CBD) proposed a model for q(t , x) = 
1 — P(Z(0);t, l,x), the probability of death in year t for someone aged x. Namely, they use 

logit q(t,x) = (3^ (x) K^\t) + /3( 2 \x)rS 2 \t), (M5) 

where logit(y) = log • 

If we let n a be the number of ages available in the data set for fitting, and take XAve = n ~ 1 Xj, 

the commonly used parameterization for the CBD model (M5) is 

/3^\x) = 1, and (3^ 2 \x) = x — XAve■ (A.l) 

Under these assumptions there are no identifiability issues. 

Appendix B. Proofs of Analytic Estimates 

Appendix B.l. Proof of Lemma 1. 

Since the noise terms ^ k \u ) are independent of k(s) for s, taking conditional expectation 
with respect to Z(s) = {/^-^( s ), £^( s )}, and writing in terms of the increments k(u) — k(u — 1) 
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yields 


E [n(t) — k(s ) | Z(s)] = ® [«(«) — K ( u ~ 1) I Z(s)] 

u=s-\-l 

t 

= E [c (1) («) + ^ 2 \u) - ^ 2 \u - 1) I Z(s)] . (B.l) 

IL=S +1 

By the independence assumption we have for ti/s + 1 

E |£ (1) (u) | Z(s)] = |U (1) (B.2) 

E £ (2) (u) -? (2) (u- 1) | Z(s) = ju (2) p - p (2) p = 0. (B.3) 

For u = s + 1, 

E [e (2) (s + 1) - £ (2 ) (s) | Z(s) 1 = ^ 2) p - £ (2) (s). (B.4) 

Combining (B.1)-(B.4), we obtain 

E [n(t) | Z(s)\ = k(s) + (t - s+ p^p - 2 \s ). (B.5) 

□ 

Appendix B.2. Proof of Lemma 2. 

Since k\ has trend // 1 , E[A>q(f) — K\(t— 1)] = p\. and using conditional independence, we obtain, 


E [ki(T + t) | Z(T)} = n\(T) + pit. (B.6) 

For the co-integration term S(t), the expected values satisfy 

E [S(T + t) | Z(T)} =p 2 + (j) (E[5(T + t - 1) | S(T)\ - p 2 ). (B.7) 

The above gives a recursive equation for t *-)• E[5(T + t) \ Z(T)\, with initial condition E[5(T + 0) | 
Z(T)\ = S(T), which can be solved to yield 

E [S(T +1 ) | Z(T)\ = /x 2 (l - </>*) + tfSiT). (B.8) 

Finally, using n 2 (t) = K\(t) — S(t), and combining (B.6) with (B.8) leads to 

E[k 2 (T + t) | Z(T)\ = «i(T) + mt - (/x 2 (l - tf) + 4? [«i (T) - k 2 (T)]) . 

as desired. □ 
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Appendix B. 3. Proof of Lemma 3 

For £W(t, s ), kW i s no different than a random walk with drift, so we have 

E[/t ( ' 1 ' ) (t) | = /«W(s) + n(t — s), s < t. 

Next, we take expectation on both sides of (42) to obtain the recursive relation 

E[ K (2 )(t) | Z{s)\ = (1 + (f)E [«( 2 )(t - 1) | Z(s)\ - (fE[^ 2 \t - 2) | Z{s)\ (B.9) 

where Z(.s) = (k^(s), k( 2 ) ( s ) , k( 2 \s — 1)}. Equation (B.9) is a recursive relation in t with general 
solution 

E[n^(t)\Z(s)\= 0 ^ + 02 , (B.10) 

where the constants ci and C 2 are to be determined. Plugging-in the initial conditions 

ci f> s + C 2 = E[/t^(s) | Z(s )] = k^ 2 \s), and (B.ll) 

ci<f s+l + C 2 = E[ac^(s + 1) | Z(s)] = (1 + ^>)E[/t ( ' 2 ^(s) — cf)K^ 2 \s — 1) | Z(s )] 

= (1 + (/))k^ 2 \s) — c/)k( 2 \s — 1). (B.12) 


and solving for ci, C 2 we obtain 

i_ k^ 2 \s) — n( 2 \s — 1 ) <fr^ 2 \s — 1 ) — nS 2 \s) 

C1 = 4, -- • « = -- 

Finally, combining (B.13) with (B.10), we arrive at (44). 


(B.13) 

□ 
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